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(54) Method and apparatus for comparison of structured documents 



(57) A document difference extraction method and 
apparatus which is used for extracting the difference 
between structured documents properly meeting the 
sense of a document editor taking the logical meaning 
and structure of the structured documents into consid- 
eration. Structured documents are edited and stored in 
a memory unit by a document editing program (104). 
With reference to a comparison criterion (107) set for 
the logical structure of each structured document before 
and after edition, the logical structure of the structural 
documents before and after edition read from the mem- 
ory unit is analyzed by a structured document parsing 
program (105), and the difference between the struc- 
tured documents is extracted by a structured document 



difference extraction program (1 06) in such a manner as 
to satisfy the comparison criterion in accordance with 
the result of parsing. The comparison criterion (107) 
assumes the form of a table containing a plurality of 
tags representing logical structures and types of tags tor 
the comparison criterion. The tag types for comparison 
criterion include tags having contents which are com- 
pared only when the particular tags are coincident with 
each other, tags having contents which are ignored at 
the time of comparison, a set of tags having the same 
logical meaning, and a set of tags having contents 
which are not compared with each other. 
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Description 

BACKGROUND OF THE INVENTION 

The present invention relates to a structured docu- 
ment difference string extraction method and apparatus 
for a document processor such as a word processor 
capable of extracting a difference character string 
between structured documents stored as an electronic 
file. 

A structured document is defined as one having 
embedded therein, i.e., containing information on the 
logical structure of a document, that is, information such 
as "this portion of the document constitutes a chapter" 
or "this portion makes up a title". 

The difference extraction between documents is 
defined as detecting a most coincident combination of 
elements constituting each document including para- 
graphs, lines and characters and extracting non-coinci- 
dent elements as a difference. Suppose that two 
documents for which the difference is to be detected are 
"ABCDEFG" and "ACDAEFH". When the two docu- 
ments are compared in terms of elements thereof 
including A, B, C, D, E, F, G and H, the most coincident 
combination is detected as "correspondence of 
ACDEF". Also, the difference is detected in the form of 
"B is deleted", "A is inserted after D" or "G is changed to 
H". 

A conventional method for difference extraction is 
disclosed in JP-A-2-255964, in which comparison is 
made in terms of punctuation marks, lines, words and 
characters. In application of this method to structured 
documents, a character string representing a logical 
structure contained in the documents is compared in 
the same manner as other character strings are com- 
pared in the documents. 

Extraction of a difference in a structured document 
by the same means as in a normal document may be 
inappropriate to the document editor, however, since the 
result may be non-coincident with the logical structure 
of the document. An example will be explained below. 

(Prior art example 1) 

With reference to the structured documents shown 
in Figs. 3A and 3B, the case will be explained in which 
documents having non-coincident logical structures are 
erroneously matched with each other in the process of 
difference extraction, thereby leading to an extraction 
result inappropriate to the document editor. 

The structured documents in Figs. 3A and 3B are 
described by SGML (Standard Generalized Markup 
Language; ISO 8879), indicating that a character string 
sandwiched by marks, for example, (A) and (/A) called 
tags is associated with a logical structure A. In other 
words, the character string "TARO HEISEI" sandwiched 
between "(NAME)" and "(/NAME)" of Fig. 3 A is associ- 
ated with the logical structure "NAME". HTML (Hyper- 
text Markup Language) which is used in WWW (World 



Wide Web) is an application of SGML and is applicable 
to the present invention as well. 

Another name of the mark representing this logical 
structure is a tag. "(A)" and "(/A)" thus are alternatively 
s called a start tag and an end tag, respectively. 

The result of extracting a difference character string 
between two structured documents in Figs. 3A and 3B 
by the conventional technique is shown in Figs. 4A and 
4B. 

w Fig. 4B shows the result of extracting difference 
character strings of the structured document in Fig. 3B 
relative to the structured document in Fig. 3A. Fig. 4A 
shows the result of extracting difference character 
strings of the structured document in Fig. 3A relative to 

15 the structured document in Fig. 3B. 

As seen from Figs. 4A and 4B, "HEISEI" associated 
with "(NAME)" and "HEISEI" associated with 
"(TRANSMISSION DATE )" are not extracted as the dif- 
ference. This is due to the fact that "HEISEI" was coinci- 

so dent and erroneously matched with each each other. 
This correspondence of "HEISEI" not coincident in logi- 
cal structure is obviously meaningless to the document 
editor. 

25 (Prior art example 2) 

With reference to the structured documents shown 
in Figs. 5A and 5B, the case will be explained in which 
character strings are matched erroneously over differ- 

30 ent document structures in the process of difference 
extraction due to the insertion of a document structure, 
thereby leading to an extraction result not proper to the 
document editor. Fig. 5A shows a structured document 
having Chapter 1, and Fig. 5B a structured document 

35 with one other chapter inserted before Chapter 1 . 

Figs. 6A, 6B show an example of extracting a differ- 
ence character string between the two structured docu- 
ments of Figs. 5A, 5B. 

Figs. 6A, 6B show a case similar to Figs. 4A, 4B, in 

40 which Fig. 6B shows the result of extracting a difference 
character string of Fig. 5B relative to Fig. 5A. Fig. 6A, on 
the other hand, shows the result of extracting a differ- 
ence character string of Fig. 5A relative to Fig. 5B. 
As seen from Fig. 6A, Chapter 1 of Fig. 6A is 

45 matched over Chapter 1 and Chapter 2 of Fig. 6B in 
spite of the fact that Chapter 1 of Fig. 6A is identical to 
Chapter 2 of Fig. 6B. This is another case inappropriate 
to the document editor. 

Dual appearance in Fig. 5B of the same character 

so string "STRUCTURED DOCUMENT" unlike in Fig. 5A 
leads to the erroneous decision in Fig. 6B that the first 
"STRUCTURED DOCUMENT" is coincident while the 
second "STRUCTURED DOCUMENT" is non-coinci- 
dent, so that the second "STRUCTURED DOCUMENT" 

55 and extracted as a difference. This is true with each of 
subsequent cases of difference extraction. 
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(Prior art example 3) 

With reference to the structured documents of Figs. 
7A, 7B, explanation will be made of the case in which 
the difference in marks representing the logical struc- 
ture of a document makes it impossible to match the 
contents of documents with each other in spite of the 
identical logical meaning of the documents, resulting in 
the extraction inappropriate to the document editor. 

In Figs. 7A, 7B, a tag (FIRST ITEM) is attached to 
only the item that first appears in spite of the fact that 
the logical meaning of the document remains the same 
and "ITEM". 

Figs. 8A, 8B show the case in which difference 
character strings between two structured documents of 
Figs. 7A and 7B are extracted by the conventional tech- 
nique. 

Figs. 8A, 8B represent a case similar to Figs. 4A, 
4B, in which Fig. 8B shows the result of extracting differ- 
ence character strings of Fig. 7B as compared with Fig. 
7A, while Fig. 8A shows the result of extracting differ- 
ence character strings of Fig. 7A as compared with Fig. 
7B. 

From Figs. 8A, 8B, it is seen that "FIRST ITEMs" 
are matched with each other and the character strings 
associated with them are compared with each other as 
the contents thereof. The logical meaning of "FIRST 
ITEM" and "ITEM" are the same for the document edi- 
tor, and therefore the contents of the tags are required 
to be matched in priority over the tags. 

In extracting the difference between structured doc- 
uments, comparison between them is required taking 
into consideration the logical meaning and the structure 
of the structured documents. This requirement is not 
met by the conventional method in which character 
strings indicating a logical structure are compared in 
similar fashion to other character strings in the docu- 
ment. 

SUMMARY OF THE INVENTION 

An object of the present invention is to provide a 
method and an apparatus for extracting a difference 
character string between structured documents in a 
manner suited to the linguistic sense of the document 
editor taking the logical meaning and structure of the 
structure documents into consideration. 

Another object of the present invention is to provide 
a method and an apparatus for managing the edition of 
a structured document for a document processing sys- 
tem capable of managing the edition on the basis of 
comparison and discrimination of the logical structures 
of structured documents. 

In order to achieve the above-mentioned objects, 
according to one aspect of the invention, there is pro- 
vided a structured document difference extraction 
method including memory means for storing structured 
documents defined as information on the logical struc- 
ture of documents before and after edition such as dele- 



tion, insertion or change, and a processor for extracting 
a character string non-coincident between the struc- 
tured documents before and after edition as a differ- 
ence, comprising the steps of: 

5 

editing and storing a structured document in the 
memory means; 

parsing the logical structures of the structured doc- 
ument before and after edition read from the mem- 
w ory unit on the basis of a set comparison criterion; 
and 

extracting the difference between the structured 
documents in such a manner as to satisfy the com- 
parison criterion in accordance with the result of 
is parsing of the structured documents. 

The comparison criterion includes tags indicating 
logical structures and types of comparison criterion cor- 
responding to the tags with the contents thereof being 
so stored in a table. 

The tags are defined to be ones of the following four 
types of comparison criterion: 

(1) Tags having the contents which are compared 
25 only when the particular tags are coincident with 

each other (identity tags) 

(2) Tags having the contents the difference of which 
is ignored at the time of comparison (ignoring tags) 

(3) A set of tags identical to each other in logical 
30 meaning (equivalence tags, such as "FIRST ITEM" 

and "ITEM") 

(4) A set of tags having the contents which are not 
compared with each other (no-comparison tags) 

35 Furthermore, a document tree representing the 
structure of each structured document is produced by 
the above-mentioned parsing method, and the differ- 
ence between the structured documents is extracted by 
comparison between the nodes of the respective docu- 

40 ment trees. In the case where given nodes are non- 
coincident with each other, the difference is extracted 
between the nodes by comparison between the charac- 
ters of the nodes. 

In addition, in producing a document tree or hierar- 

45 chy representing each document structure by the afore- 
mentioned parsing method, the allocation of the nodes 
of the document trees is altered in accordance with the 
comparison criterion described above. 

According to another aspect of the invention, there 

so is provided a structured document difference extraction 
apparatus comprising a memory means for storing 
structured documents before and after edition including 
deletion, insertion or change, and a processor for 
extracting at least a non-coincident character string of 

55 each structured document before and after edition as a 
difference between the structured documents, wherein: 
the processor includes means for editing the 
structured documents and storing the result of edition in 
the memory means, means for parsing the logical struc- 
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ture of structured documents before and after edition 
read from the memory means on the basis of a preset 
comparison criterion, and means for extracting the dif- 
ference between the structured documents in such a 
manner as to meet the comparison criterion in accord- 5 
ance with the result of parsing of the structured docu- 

The extraction means includes a table for storing 
tags representing logical structures and types of crite- 
rion for the tags. 'f 

The following four criterion types of tags are defined 
beforehand for comparison: 

(1) Tags having the contents which are compared 
only when the particular tags are coincident with is 
other 

(2) Tags having the contents the difference of which 
is ignored at the time of comparison 

(3) A set of tags identical in logical meaning to each 
other 20 

(4) A set of tags having the contents which are not 
compared with each other 

Further, the structured document parsing means 
produces a document tree representing the structure of 25 
each document, and the structured document differ- 
ence extraction means extracts the difference between 
the structured documents before and after edition by 
comparing the respective document trees by node. In 
the case where a given pair of nodes between a pair of 30 
structured documents fail to be coincident with each 
other, the difference is extracted by comparing the par- 
ticular nodes, this time, by character. 

In addition, the structured document parsing 
means, when producing a document tree representing a 35 
document structure, alters the allocation of the nodes of 
the document tree in accordance with the comparison 
criterion. 

With the solutions as described above, structured 
documents are edited, the logical structure of the struc- <to 
tured documents edited is analyzed by the structured 
document parsing means, a comparison criterion used 
for extracting the difference corresponding to the logical 
structure is set in advance, and a difference character 
string between the structured documents before and « 
after edition is extracted in such a manner as to meet 
the comparison criterion. The more relevant difference 
conforming with the linguistic sense of the editor can 
thus be automatically extracted in accordance with the 
logical structure. so 

Also, the difference is extracted by node between 
document trees, whereas the difference between non- 
coincident nodes is extracted by character, so that an 
erroneous extraction of the difference over different 
structures can be eliminated. ss 

BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is a block diagram showing the configuration 



of an embodiment of the present invention. 

Fig. 2A is a diagram showing the processing steps 
according to an embodiment of the invention. 

Fig. 2B is a flowchart showing a detailed example of 
steps of producing a document tree shown in Fig. 2A. 

Figs. 3A, 3B are diagrams showing a first example 
of structured-documents before and after edition, 
respectively. 

Figs. 4A, 4B are diagrams showing the first exam- 
ple of the structured documents before and after differ- 
ence extraction, respectively according to the prior art. 

Figs. 5A, 5B are diagrams showing a second exam- 
ple of structured documents before and after edition, 
respectively. 

Figs. 6A, 6B are diagrams showing the second 
example of the structured documents before and after 
difference extraction, respectively according to the prior 
art. 

Figs. 7A, 7B are diagrams showing a third example 
of structured documents before and after edition, 
respectively. 

Figs. 8A, 8B are diagrams showing the third exam- 
ple of the structured documents before and after differ- 
ence extraction, respectively according to the prior art 
method 

Fig. 9 shows an example comparison criterion table 
for the first example of structured documents according 
to the present invention. 

Figs. 10A, 10B are diagrams showing document 
trees produced from the first example of structured doc- 
uments before and after edition shown in Figs. 3A, 3B 
on the basis of the comparison criterion table of Fig. 9. 

Fig. 10C is a flow diagram showing production pro- 
cedure for document tree of Fig. 10A. 

Figs. 11 A, 11B are diagrams showing the first 
example of the structured documents before and after 
difference extraction, respectively based on the com- 
parison criterion table of Fig. 9. 

Fig. 12 shows an example comparison criterion 
table for the second example of the structured docu- 
ments shown in Fig. 5. 

Figs. 13A, 13B are diagrams showing document 
trees produced from the second example of the struc- 
tured documents before and after edition shown in Figs. 
5A, 5B, respectively on the basis of the comparison cri- 
terion table of Fig. 12. 

Figs. 14A, 14B are diagrams showing the second 
example of the structured documents of Figs. 5A, 5B 
before and after difference extraction, respectively 
based on the comparison criterion table of Fig. 12. 

Fig. 15 shows an example comparison criterion 
table for a third example of the structured documents 
shown in Figs. 7A, 7B. 

Figs. 16A, 16B are diagrams showing document 
trees produced from the third example of the structured 
documents before and after edition shown in Figs. 7A, 
7B, respectively on the basis of the comparison criterion 
table of Fig. 15. 

Figs. 17A, 17B are diagrams showing the third 
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example of structured documents of Figs. 7A, 7B before 
and after difference extraction, respectively based on 
the comparison criterion table of Fig. 15. 

Figs. 18A, 18B are diagrams showing a fourth 
example of structured documents before and after edi- 5 
tion, respectively. 

Fig. 19 shows an example comparison criterion 
table for the fourth example of the structured documents 
shown in Fig. 18. 

Figs. 20 A, 20 B are diagrams showing document 10 
trees produced from the fourth example of the struc- 
tured documents before and after edition, respectively 
shown in Fig. 18 on the basis of the comparison crite- 
rion table of Fig. 19. 

Figs. 21 A, 21 B are diagrams showing the fourth is 
example of structured documents shown in Fig. 18 
before and after difference extraction, respectively 
based on the comparison criterion table of Fig. 19. 

Fig. 22 is a flowchart showing another embodiment 
of the invention. 20 

Figs. 23A, 23B are diagrams showing an example 
comparison of documents to be compared according to 
the embodiment of Fig. 22. 

Figs. 24A, 24B are diagrams showing an example 
result of comparison between the structured documents 25 
of Figs. 23A, 23B, respectively. 

Fig. 25 is a diagram showing an example structured 
document representing the structured document differ- 
ence data. 

Figs. 26A, 26B are diagrams showing an example 30 
of structured documents displayed on the screen before 
and after edition, respectively. 

Fig. 27 is a diagram showing an example of a struc- 
tured document difference data displayed on the 
screen. 35 

DESCRIPTION OF THE PREFERRED EMBODI- 
MENTS 

Embodiments of the invention will be described 40 
below with reference to the accompanying drawings. 

Fig. 1 shows the configuration of an embodiment of 
the invention. 

In Fig. 1, a reference numeral 101 designates a 
CPU, numeral 102 a terminal device including an 45 
input/output device, a display device and a program 
storage loading device 1 03A on which a processing pro- 
gram storing medium such as a floppy disk or the like is 
mounted, and numeral 103 a memory unit fa storing 
documents and/or a processing program, capable of so 
functioning as a program storage alternative to the 
floppy disk. The CPU 101 has executably set therein a 
document editing program 104 for editing documents, a 
structured document parsing program 105 for convert- 
ing each structured document into a tree configuration. 55 
a structured document difference extraction program 
106 for extracting non-coincident portions of the struc- 
tured documents as a difference, and a comparison cri- 
terion table 107 for storing comparison criteria for 



extraction of difference character strings. These pro- 
grams can be supplied to the CPU 101 in a form stored 
in the floppy disk in advance. 

Each of the structured documents according to this 
embodiment assumes the form of an SGML document. 
SGML, as described above, is defined as a document 
description language set as an ISO world standard of 
marked structured documents. SGML documents have 
the logical structure thereof defined in advance by the 
document type definition (DTD). Nevertheless, it should 
be understood that the present embodiment is applica- 
ble also to the processing of structured documents hav- 
ing a function analogous to SGML. 

Specific processing steps according to the present 
embodiment will be described with reference to the 
flowcharts of Figs. 2 A and 2B. 

Step 201: 

Structured documents are edited by the document 
editing program 104 

Step 202: 

The comparison criterion table 107 corresponding 
to the DTD of the SGML documents to be compared is 
read into the work area of the CPU 1 01 . 

In the absence of a comparison criterion table cor- 
responding to the DTD of the SGML documents, an 
appropriate table is prepared and entered in advance. 

This comparison criterion table is one including 
tags satisfying the following four criteria: 

(1) Identity tag: It represents different tags allowing 
the respective contents thereof, i.e., the characters 
sandwiched between the start and end ones of the 
respective tags to be compared with each other 
only when the tag pairs are coincident with each 
other 

(2) Ignoring tag: It represents a tag having the con- 
tents the difference of which is ignored at the time 
of comparison 

(3) Equivalence tags: These represent a set of 
apparently different tags having the same logical 
meaning 

(4) No-comparison tags: These represent a set of 
tags which negate the comparison of the contents 
thereof with each other 

Step 203: 

When the difference extraction program 106 is 
called in Fig. 2A, the structured documents are ana- 
lyzed by the structured document parsing program 105 
by reference to the comparison criterion table 107 
thereby to prepare document trees. The steps of an 
parsing program for the structured documents are 
shown in detail in Fig. 2B. 

In the process, the elements allocated to each node 
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of the document tree are determined according to the 
rules established as follows: 

Rule 1 : Allocate each tag to a node. 

Rule 2: Allocate the character strings sandwiched 5 

between a start tag and an end tag to a child 

node of the start tag. 
Rule 3: Allocate each end tag to a child node of the 

start tag associated with the particular end 

tag. io 
Rule 4: Allocate the character strings sandwiched 

belween identity tags to a single node 

together with the starting and end tags 

thereof. 

Rule 5: Don't allocate ignoring tags and the charac- is 
ter strings sandwiched between the ignor- 
ing tags to any node. 

Rule 6: Allocate equivalence tags to nodes by con- 
verting the apparently different names 
thereof into an identical tag name. 20 

Step 204: 

The document trees prepared by the above-men- 
tioned steps are compared by node with each other and 25 
the difference is extracted by node. In the case where 
the tags to be compared are no-comparison tags, the 
particular nodes and underlying nodes (child nodes) are 
not compared. 

30 

Step 205: 

The difference is extracted, this time, by character, 
only for the nodes found non-coincident. For a node of 
an identity tag, however, comparison by character is 35 
made only when the leading character (string) constitut- 
ing a tag of the node is coincident. The ignoring tags 
that were not compared at step 204 are compared at the 
present step. 

40 

Step 206: 

The difference extraction output of step 205 is dis- 
played on the display unit of the terminal device 102 
(step 206A) . At the same time, the same difference out- 45 
put can be supplied to a difference data utilization 
device in parallel to the display unit. The CPU 101 can 
automatically execute such processes as updating and 
revision of relevant parameters in accordance with the 
difference output. These functions can be considered so 
as a review. Fig. 2B shows the process of parsing struc- 
tured documents in steps 301 to 31 1. 

Processing Example 1 

55 

A specific example of processing according to the 
embodiment having an identity tag is described below 
with reference to the example documents shown in 
Figs. 3Aand3B. Step 201: 



The structured documents are edited by the docu- 
ment editing program 104 (Fig. 1). The document of Fig. 
3B is assumed to have been edited from that of Fig. 3A. 

Step 202: 

The comparison criterion table 107 corresponding 
to the DTD of the SGML documents to be compared is 
read out to the CPU 101. 

In the absence of a corresponding comparison cri- 
terion table, an appropriate table is first produced and 
entered. 

A comparison criterion table as shown in Fig. 9, for 
example, is produced from Figs. 3A and 3B. Specif ically, 
"(NAME)" and "(TRANSMISSION DATE)" are defined 
as identity tags, which means that character strings are 
not matched unless the tags are coincident between the 
documents to be compared. 

Step 203: 

Once the difference extraction program 106 is 
called, the structured documents to be compared are 
analyzed by the structured document parsing program 
105 while referring to the comparison criterion table 
107, thereby producing corresponding document trees. 

By applying the rules described above with refer- 
ence to an embodiment, the document trees of Figs. 
10A, 10B are produced from the structured documents 
of Figs. 3A, 3B respectively by referring to the compari- 
son criterion table of Fig. 9. 

Structured documents 1001, 1002 in Figs. 10A, 
10B have identity tags and therefore the tags and con- 
tent characters thereof are allocated collectively to a 
single node according to Rule 4. The process of produc- 
ing document trees of Figs. 10A, 10B for difference 
extraction is shown as steps 401 to 406 in Fig. 10C. 

Step 204: 

The difference is extracted by node between the 
document trees. 

Since comparison is made by node, "(NAME )" and 
"(TRANSMISSION DATE)" which are identity tags are 
not matched as long as the particular tags and the char- 
acter strings of the contents thereof are both coincident 
with each other. In such a case, due to the non-coinci- 
dence between the tags 1001 and 1002, both the tags 
and the contents thereof are extracted as a difference. 

Step 205: 

The difference between non-coincident nodes is 
extracted by character. Nodes having an identity tag. 
however, are compared by character only in the case 
where the leading character string constituting each of 
the tags of the respective nodes is coincident. 
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Step 206: 

The resulting difference is displayed on the terminal 
device 102. 

An example result of difference extraction between 
the documents of Figs. 3A and 3B is shown in Fig. 1 1 . 

Fig. 1 1B shows the result of extracting difference 
character strings taken of the structured document of 
Fig. 3B as compared with the structured document of 
Fig. 3A. Fig. 1 1 A, on the other hand, shows the result of 
extracting difference character strings taken of the 
structured document of Fig. 3A as compared with the 
structured document of Fig. 3B. 

In Fig. 1 1 B, the tag marks "(NAME)" and 
"(TRANSMISSION DATE)" of nodes 1001 and 1002 fail 
to coincide with each other, and therefore the character 
string "(TRANSMISSION DATE) NOVEMBER 20, 
SIXTH YEAR OF HEISEI (/TRANSMISSION DATE)" of 
node 1002 is extracted in its entirety as a difference. 
Also, since Fig. 3A contains no description of "ARE 
YOU FINE" in Fig. 3B, "ARE YOU FINE" is extracted as 
a difference. 

If the difference extraction is executed according to 
the above-mentioned steps, as long as a tag containing 
the characters the comparison of which is meaningless 
in the absence of tag coincidence is entered as an iden- 
tity tag, structured documents of non-coincident logical 
structures are not matched with each other. A more 
appropriate difference extraction result thus can be pre- 
sented to the editor. 

Processing Example 2 

The document examples of Figs. 5A, 5B will be 
explained as a second specific process according to the 
embodiment with reference to the case having both an 
identity tag and an ignoring tag and involving a struc- 
tural displacement. 

Step 201: 

Structured documents are edited by the document 
editing program 104. The document of Fig. 5B is 
assumed to have been edited from the document of Fig. 
5A. 

Step 202: 

The comparison criterion table 107 corresponding 
to the DTD of the SGML document to be compared is 
read at this step. 

In the absence of a corresponding comparison cri- 
terion table, an appropriate table is produced and 
entered. 

In the case of Figs. 5A, 5B, for example, a compar- 
ison criterion table as shown in Fig. 12 is produced. 
Specifically, "(AUTHOR NAME)" is defined as an iden- 
tity tag. In this case, as described above, the character 
strings are compared with each other only when the 



tags are coincident with each other. Also, "(CHAPTER 
NUMBER)" is defined as an ignoring tag. In this case, 
the difference in chapter number is ignored. This is 
because it has no effect on difference extraction. 

5 

Step 203: 

Once the difference extraction program 106 is 
called, the SGML documents are analyzed by the struc- 

w tured document parsing program 105, and correspond- 
ing document trees are produced while referring to the 
comparison criterion table 1 07. 

By application of the rules explained with reference 
to an embodiment above, the document trees of Figs. 

is 13A, 13B are produced by referring to the comparison 
criterion table of Fig. 1 2 from the documents of Figs. 5A, 
5B. "(CHAPTER NUMBER)" providing an ignoring tag 
is not allocated as a node according to Rule 5 above. 



The difference between document trees is 
extracted by node. 

The ignoring tags, which are not present as a node, 
25 are not compared and have no effect on the whole proc- 
ess of difference extraction. 

Step 205: 

30 The difference between non-coincident nodes is 
extracted by character string. The ignoring tags and the 
contents thereof that were not compared at step 204 are 
also compared at this step. 



The resulting difference is displayed on the terminal 
device 102. 

An example result of difference extraction between 
40 the documents of Figs. 5A and 5B is shown in Figs. 
14A, 14B. Fig. 14B shows the result of extracting a dif- 
ference character string taken of the structured docu- 
ment of Fig. 5B as compared with the structured 
document of Fig. 5A. Fig. 14A, on the other hand, is a 
45 diagram showing the result of extracting a difference 
character string taken of the structured document of Fig. 
5A as compared with the structured document of Fig. 
5B. 

Explanation will be made about the case in which 
so the difference is taken of the structured document of 
Fig. 5B as compared with the structured document of 
Fig. 5A and the result of extracting the difference char- 
acter string is obtained as shown in Fig. 14B. 

In the difference extraction by node between docu- 
55 ment trees at step 204, " (TREATISE )", "(/TREATISE )", 
"(AUTHOR NAME)TARO HEISEI (/AUTHOR NAME)", 
and "(CHAPTER ) STRUCTURED DOCUMENT DIF- 
FERENCE EXTRACTION METHOD (/CHAPTER)" are 
determined to be coincident in Figs. 13A, 13B, so that 



20 Step 204: 



25 



35 Step 206: 
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they are displayed as coincident parts in Fig. 14B. 

Since step 204 decides that " (CHAPTER ) STRUC- 
TURED DOCUMENT DIFFERENCE EXTRACTION 
METHOD </ CHAPTER)" is coincident, step 205 
decides that "(CHAPTER NUMBER) and (/CHAPTER 5 
NUMBER)" associated with the coincident part is also 
coincident. On the other hand, "CHAPTER 2", which is 
not coincident with "CHAPTER 1", is extracted as a dif- 
ference and displayed as shown in Fig. 14B. 

Also, due to the decision at step 204 that 10 
"(CHAPTER) WHAT IS STRUCTURED DOCUMENT? 
(/CHAPTER)" in Fig. 13B is not coincident, this 
"(CHAPTER) WHAT IS STRUCTURED DOCUMENT? 
(/CHAPTER)" and "(CHAPTER NUMBER) CHAPTER 
1 (/CHAPTER NUMBER)" associated with the particu- is 
lar non-coincident part are extracted as a difference and 
displayed as shown in Fig. 14B. 

In the difference extraction according to the steps 
described above, document trees are compared by 
node, i.e., by structure, and therefore nodes 1301 and so 
1302, for example, are matched in this process. As a 
result, it is seen that an erroneous matching does not 
occur over different structures as shown in Fig. 6. Since 
comparison of document trees by node includes no 
comparison between ignoring tags, any difference in the 25 
contents of the ignoring tags is seen to have no effect on 
the difference extraction process as a whole. 

Processing Example 3 

30 

A third specific processing example according to an 
embodiment having an identity tag and an equivalence 
tag will be explained with reference to the example doc- 
uments of Figs. 7A, 7B. 

35 

Step 201: 

Structured documents are edited by the document 
editing program 104. It is assumed that the document of 
Fig. 7B is edited from the document of Fig. 7A. « 

Step 202: 

A comparison criterion table 107 corresponding to 
the DTD of the SGML documents to be compared is 45 
read at this step. 

In the absence of a corresponding comparison cri- 
terion table, an appropriate table is produced and 
entered. 

In the case of Figs. 7A, 7B, a comparison criterion so 
table as shown in Fig. 15 is produced. In other words, 
" (AUTHOR NAME )" is defined as an identity tag. In this 
case, as long as given tags fail to coincide with each 
other, the character strings associated with them are 
not matched. Also, "(ITEM)" and "(FIRST ITEM)" are 55 
defined as equivalence tags. In the last case, "(ITEM)" 
and "(FIRST ITEM)" are considered to have the same 
logical structure. 



Step 203: 

Once the difference extraction program 106 is 
called, the SGML document is analyzed by the struc- 
tured document parsing program 105 and document 
trees are produced while referring to the comparison 
criterion table 107. 

Application of the rules described above with refer- 
ence to an embodiment permits the document trees of 
Figs. 16A, 16B to be produced from the documents of 
Figs. 7A, 7B respectively by reference to the compari- 
son criterion table of Fig. 15. 

Nodes 1601, 1602, 1603 in Fig. 16 are converted 
into the same tag name under Rule 6. 

Step 204: 

The difference between the document trees is 
extracted by node. The equivalence tags are given the 
same tag name and therefore are not extracted as a dif- 
ference. 

Step 205: 

Only those tags which are found non-coincident 
with each other are extracted, this time, by character. 

Step 206: 

The resulting difference is displayed on the terminal 
device 102. 

An example of extracting the difference between 
the documents of Figs. 7A, 7B is shown in Figs. 17A, 
17B. 

Fig. 17B shows the result of extracting difference 
character strings taken of the structured document of 
Fig. 7B as compared with the structured document of 
Fig. 7A, and Fig. 17A is the result of extracting differ- 
ence character strings taken of the structured document 
of Fig. 7A as compared with the structured document of 
Fig. 7B. 

Explanation will be made about the case in which 
the difference is taken of the structured document of 
Fig. 7B as compared with the structured document of 
Fig. 7A and the extraction result of Fig. 17B is obtained. 

In extracting the difference between the document 
trees by node at step 204, it is decided in Figs. 1 6A, 1 6B 
that "(TREATISE)", "(/TREATISE)", "(AUTHOR NAME) 
TARO HEISEI (/AUTHOR NAME)", and "(ITEM) 
STRUCTURED DOCUMENT DIFFERENCE EXTRAC- 
TION METHOD (/ITEM)" are determined to be coinci- 
dent, and are displayed as coincident parts in Fig. 17B. 

Next, due to the decision at step 204 that "(ITEM) 
WHAT IS STRUCTURED DOCUMENT? (/ITEM)" is 
non-coincident, step 205 extracts the difference of the 
non-coincident part by character, so that "(ITEM) WHAT 
IS STRUCTURED DOCUMENT ? (/ITEM)" is extracted 
as a difference and displayed as shown in Fig. 17B. 

Upon extraction of the difference according to the 
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steps described above, the documents having the same 
logical structure are seen to be matched with each other 
despite the difference in tag name. 

Processing Example 4 5 

A fourth specific processing example according to 
an embodiment will be explained with reference to the 
documents of Figs. 18A, 18B having a no-comparison 
tag. 10 

Step 201. 

A structured document is edited by the document 
editing program 104. The document of Fig. 18B is 15 
assumed to be edited from the document of Fig. 18A. 

Step 202: 

A comparison criterion table 107 is read in which 20 
corresponds to the DTD of the SGML document to be 
compared. 

In the absence of a corresponding comparison cri- 
terion table, an appropriate table is produced and 
entered. 25 

In the case of Figs. 18A, 18B, for example, a com- 
parison criterion table as shown in Fig. 19 is produced. 
In other words, "(SENDER)" and "(RECEIVER)" are 
assumed to have a no-comparison tag. In this case, 
"(SENDER)" and "(RECEIVER)" are not compared in 30 
contents. 

Step 203: 

Once the difference extraction program 106 is 35 
called, the SGML document is analyzed by the struc- 
tured document parsing program 105 and a document 
tree is produced while referring to the comparison crite- 
rion table 107. 

By applying the rules described above with refer- 40 
ence to an embodiment, the document trees of Figs. 
20A, 20B are completed from the documents of Figs. 
18A, 18B by referring to the comparison criterion table 
of Fig 19. 

Step 204: 

The difference between document trees is 
extracted by node. "(SENDER)" and "(RECEIVER)" 
have tags of no-comparison type, and therefore under- so 
lying nodes, that is, "(ORGANIZATION)" and "(NAME)" 
providing child nodes, are not compared. 

Step 205: 

55 

The difference between only those nodes which are 
non-coincident with each other is extracted, this time, by 
character. 



Step 206: 

The resulting difference is displayed on the terminal 
device 102. 

An example of extracting the difference between 
the documents of Figs. 1 8A, 1 8B is shown in Fig. 21 . 

Fig. 21 B shows the result of extracting the differ- 
ence character string taken of the structured document 
of Fig. 18B as compared with the structured document 
of Fig. 18A, and Fig. 21 A the result of extracting the dif- 
ference character string taken of the structured docu- 
ment of Fig. 18A as compared with the structured 
document of Fig. 18B. 

Explanation will be made about the case in which 
the difference is taken of the structured document of 
Fig. 18B as compared with the structured document of 
Fig. 1 8A thereby to obtain the result of extracting the dif- 
ference character string shown in Fig. 21 B. 

In extracting the difference between the document 
trees by node at step 204, as shown in Figs. 18A, 18B, 
"(MEMO)", "(/MEMO)", "(TEXT)" and "(/TEXT)" are 
determined to be coincident with each other, while 
"(RECEIVER)", "(/RECEIVER)" and the contents 
thereof including "(ORGANIZATION) 00 BANK 
(/ORGANIZATION)" and "(NAME) TARO HEISEI 
(/NAME)" are determined to be a difference, since 
"(SENDER)" and "(RECEIVER)" are a no-comparison 
tags. "HELLOW. ARE YOU FINE? " is determined to be 
non-coincident. 

Due to the non-coincidence decision on "HELLOW, 
ARE YOU FINE?" at step 204, step 205 extracts the dif- 
ference by character for the non-coincident part, so that 
"ARE YOUR FINE?' is extracted as a difference. 

As a consequence, the document as shown in Fig. 
21 Bis displayed. 

In the difference extraction following the steps 
described above, once tags with the contents thereof 
not compared are entered as no-comparison tags, 
underlying nodes (child nodes) are not compared, and 
therefore the organizations and the names contained in 
"(SENDER)" and "(RECEIVER)" are not matched with 
each other, thereby making it possible to present a more 
appropriate result of difference extraction to the editor. 

Another embodiment of the invention is shown in 
Fig. 22. The difference information which is extracted as 
a change between structured documents before and 
after edition using the scheme as disclosed in the 
above-mentioned embodiments has the following fea- 
tures different from comparison between non-struc- 
trured documents: 

(1) the change of the structure per se and the 
change of character strings in the structure are 
involved; and 

(2) the difference information has a logical struc- 
ture. This will be described with reference to struc- 
tured documents shown in Figs. 23A and 23B. 

Fig. 24A shows an example result of comparing a 
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structured document a before edition with a structured 
document a' after edition shown in Figs. 23A and 23B. 
Item 1 (601) and item 3 (603) represent an example in 
which the character strings in a structure are altered 
without altering the document structures including 
"(NAME)" and "(TEXT)". Item number 2 (602) shows 
an example in which the structure of 
" (ORGANIZATION )" is newly inserted. 

Now, an example will be explained in which the dif- 
ference information between structured documents has 
a logical structure. For example, item number 1 (601) 
and item number 3 (603) represent an alteration in char- 
acter string. If the difference information is to be 
expressed by specifying a structure, for example, to the 
effect that the character string alteration is one occur- 
ring in the structure of "(NAME)" and "(TEXT)" respec- 
tively, then the difference data is required to have 
structural information. Also, item number 2 (602) has 
structural information that the inserted 
" (ORGANIZATION >", which lies within the framework of 
the logical structure "(SENDER)", is a child structure of 
"(SENDER)". 

According to the prior art method, however, these 
characteristics of the difference data of structured docu- 
ments could not be displayed effectively. According to 
the prior art method, even if an alteration is one of infor- 
mation relating to the logical structure of a document, it 
is displayed by altering the display attribute of the char- 
acter indicating the structure without discriminating it 
from an alteration in the character string. The resulting 
problem is that it is difficult for the user to determine 
whether the structure or the content of the structure is 
altered. This problem is described with reference to a 
specific example. Fig. 24B shows an example display of 
difference data according to a comparative example of 
JP-A-7-200370. In this display method, the structural 
information is ignored without discriminating the altera- 
tion of a structure from that of a character string in the 
structure. Consequently, the actual alteration that is 
executed cannot be easily understood by the user who 
edits the structured document by means of a document 
editing software or the like. Also, in the case where the 
document editing software or the like uses a dedicated 
display program by expressing the structural information 
in a tree for displaying a structured document, a sepa- 
rate display program is required for displaying the differ- 
ence data such as shown in Figs. 24A, 24B, thereby 
inconveniently complicating the program. 

The embodiment of Fig. 22, as compared with the 
embodiment of Fig.1 in which the altered parts between 
structured documents are extracted on the basis of log- 
ical structure information, is different in that step 507 is 
added for displaying and storing (editing) the difference 
information from a structured difference information out- 
put step 505. Steps 501 to 506, therefore, are substan- 
tially similar to steps 201 to 206 in Fig. 1 . 

Step 507 displays the resulting difference on the 
terminal device 102 according to a display/preserve 
program 1 10, and stores the structured difference data 



in a secondary memory unit 103. Since the difference 
data as illustrated in Fig. 25 is output in SGML form, the 
difference data can be displayed directly using an editor 
or a viewer exclusive to SGML. Figs. 26A and 26B show 

5 an example structured document displayed on a dedi- 
cated SGML editor, and Fig. 27 an example display of 
the difference data. In Figs. 26A, 26B, numeral 2301 
designates a window for displaying the structure, and 
numeral 2302 a window for displaying the character 

10 strings in the structure. Fig. 27 shows an example win- 
dow displaying the difference data of Fig. 25 in struc- 
tured form. In the process, an alteration of a structure is 
displayed by altering the color or type of the mark repre- 
senting the structure, by defining the altered part by a 

75 solid line or by otherwise discriminating the altered part. 
An altered part of a character string is also displayed in 
discrimination from other character strings in similar 
fashion. These discriminated display may be high- 
lighted. 

so With the foregoing steps, the difference data can be 
directly displayed in structured form by incorporating 
this scheme in the SGML document edition software as 
a document comparison function. By discriminating an 
alteration in a structure from that of a character string in 

25 a structure, for example, the actual alteration can be 
easily understood by the user editing the structured 
document by means of the document edition software 
or the like. Also, even in the case where the document 
edition software or the like uses a dedicated program for 

30 indicating structural information by a tree when display- 
ing a structured document, an altered part can be dis- 
played without any independent display program. 
Similarly to the embodiment of Fig. 2A, the structured 
difference data may be used to update and/or revise 

35 structured documents to be edited in the step 507 or 
after completion of editing using known document 
(update) processing programs. 

It will thus be understood from the foregoing 
description that according to the present invention, a 

40 comparison criterion corresponding to a logical struc- 
ture of a structured document is defined, and the differ- 
ence of a structured document to be compared is 
extracted in such a manner as to meet the comparison 
criterion, whereby a difference conforming with the 

45 sense of the editor is extracted in accordance with the 
meaning of the logical structure. Also, the difference 
between document trees representing structures is 
extracted by node, and any difference between the non- 
coincident nodes of the documents to be compared is 

so extracted by character. Consequently, a difference over 
different structures, if any, is not extracted, with the 
result that the editor can grasp the difference suitable 
for the particular logical structure, thereby improving the 
efficiency of editing a structured document. The present 

55 invention is effectively applicable to automatic updating 
of documents likely to be revised including various legal 
documents and operation manuals described in SGML 
or the like language. Further, the efficient edition 
according to the invention is effective for managing 
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plates of documents which are required to be updated 
frequently. 

Claims 

1. A structured document difference extraction 
method in which structured documents before and 
after edition including deletion, insertion or change 
are stored in a memory unit, and a character string 
non-coincident between the structured documents 
before and after said edition is extracted as a differ- 
ence by a processor, said method comprising the 
steps of: 

editing and storing structured documents ts 
before and after edition in said memory unit 
(201); 

parsing a logical structure of each of the struc- 
tured documents before and after edition read 
from said memory unit on the basis of a com- 20 
parison criterion set for the logical structure of 
each of said structured documents before and 
after edition (203); and 

extracting a difference between the structured 
documents which can satisfy said comparison 25 
criterion with respect to the result of said pars- 
ing (204, 205). 

2. A structured document difference extraction 
method according to Claim 1 , wherein said compar- 30 
ison criterion includes a table having at least a tag 
representing a logical structure and at least a type 

of comparison criterion for said tag. 

3. A structured document difference extraction 35 
method according to Claim 2, wherein the tags are 
defined to have one of at least four types of compar- 
ison criterion as: 

tags having contents which are compared 
only when the particular tags are coincident with 40 
each other, 

tags having contents the difference of which 
is ignored at the time of comparison, 

a set of tags having the same logical mean- 
ing, and 45 

a set of tags having contents which are not 
compared with each other. 

4. A structured document difference extraction 
method according to any one of Claims 1 to 3, fur- so 
ther comprising the steps of: 

producing a document tree representing a doc- 
ument structure by parsing each of said struc- 
tured documents (203); 
extracting the difference between the docu- 
ment trees by node as the difference between 
the structured documents (204); and 
extracting the difference by character between 



non-coincident nodes (205). 

5. A structured document difference extraction 
method according to Claim 4, further comprising 
the step of: 

altering the method of allocating the nodes of a 
document tree representing the document 
structure in accordance with said comparison 
criterion at the time of producing said docu- 
ment tree by parsing said structured document. 

6. A structured document difference extraction appa- 
ratus comprising a memory unit for storing struc- 
tured documents before and after executing edition 
including deletion, insertion or change, and a proc- 
essor for extracting a non-coincident character 
string between the two structured documents 
before and after edition as a difference, wherein 
said processor includes: 

means (104) for editing and storing the 
structured documents in said memory unit; 

means (105) for parsing a logical structure of 
the structured documents before and after edition 
read from said memory unit on the basis of a com- 
parison criterion set for the logical structure of each 
structured document before and after edition; and 

means (106) for extracting a difference 
between the structured documents so as to satisfy 
said comparison criterion in accordance with the 
result of parsing of the structured documents. 

7. A structured document difference extraction appa- 
ratus according to Claim 6, wherein said compari- 
son criterion assumes the form of a table including 
at least a tag representing the logical structure and 
at least a type of comparison criterion for said tag. 

8. A structured document difference extraction appa- 
ratus according to Claim 7, wherein the tags are 
defined to have one of at least four types of compar- 
ison criterion as 

(a) tags having contents which are compared 
only when the particular tags are coincident 
with each other 

(b) tags having contents the difference of which 
is ignored at the time of comparison 

(c) a set of tags having the same logical mean- 
ing 

(d) a set of tags having contents which are not 
compared with each other. 

9. A structured document difference extraction appa- 
ratus according to any one of Claims 6 to 8, 
wherein: 

said structured document parsing means 
produces at least a document tree representing the 
document structure, and said structured document 
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difference extraction means extracts the difference 
between the document trees by node as the differ- 
ence between the structured documents, the differ- 
ence being extracted by character between non- 
coincident nodes. 

10. A structured document difference extraction appa- 
ratus according to Claim 9, wherein: 

said structured document parsing means 
alters the allocation of the nodes of a document tree 
representing the document structure in accordance 
with said comparison criterion at the time of produc- 
ing the document tree. 

11. A processor-readable medium storing program 
codes for allowing a computer comprising a mem- 
ory unit and a processor to extract a non-coincident 
character string between structured documents 
before and after edition, comprising: 

first program code means (104) for causing 
structured documents to be edited and the 
structured documents before and after edition 
to be stored in said memory unit; 
second program code means (1 05) for causing 
a logical structure of each of said structured 
documents before and after edition read from 
said memory unit to be analyzed on the basis 
of a comparison criterion preset for the logical 
structure of each of said structured documents 
before and after edition; and 
third program code means (106) for causing 
the difference between the structured docu- 
ments before and after edition to be extracted 
which can satisfy a preset comparison criterion 
with respect to the result of parsing of said 
structured documents. 

12. A medium according to Claim 11, further compris- 
ing: 

fourth program code means (107) for causing a 
table including marks representing the logical 
structures of the structured documents and the 
types of comparison criterion for said marks to 
be prepared as said preset comparison crite- 
rion. 

1 3. A medium according to Claim 12, wherein: 

said marks defined by different types of com- 
parison criterion of which a relation is stored in said 
comparison criterion table include 

identity marks having the contents which are 
compared only when the particular marks are coin- 
cident with each other, 

ignoring marks having the contents the dif- 
ference of which is ignored at the time of compari- 
son, 

equivalence marks representing a set of 



marks having the same logical meaning, and 

no-comparison marks representing a set of 
marks having the contents which are not compared 
with each other. 

5 

14. A medium according to Claim 12, wherein: 

said second program code means includes a 
program code section (203) for causing said com- 
puter to produce a document tree representing the 
10 structure of each of the structured documents 
before and after edition at the time of parsing said 
structured documents, and 

said third program code means includes a 
program code section (204, 205) for causing said 
is computer to extract the difference between said 
document trees by node as the difference between 
said structured documents before and after edition 
and also to extract the difference by character 
between non-coincident nodes. 

20 

15. A medium according to Claim 14, wherein: 

said second program code means for pars- 
ing of structured documents includes a program 
code section for causing the computer to alter the 
25 node configuration of said document tree repre- 
senting the document structure in accordance with 
said comparison criterion at the time of producing 
said document tree. 

30 16. A medium according to Claim 1 1 , further compris- 
ing a program code section (206, 506) for causing 
the computer to apply to a utilization means the dif- 
ference between the structured documents 
extracted on the basis of the execution of said third 

35 program code means. 

17. A structured document difference extraction appa- 
ratus according to Claim 6, further comprising: 

40 structured document display means for output- 

ting the result of a difference extracted by said 
structured document difference extraction 
means, as difference information to display the 
difference result on the basis of the structured 

45 difference information 

document update means for updating/revising 
structured documents to be updated/revised on 
the basis of said structured difference informa- 
tion produced from said structured document 
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FIG. 2A 
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FIG. 3A 



<MEMO> 

<NAME> TARO HEISEI </ NAME> 

<TEXT> 
HELLO. 

</ TEXT> 
</ MEMO> 



FIG. 3B 



<MEMO> 

TRANSMISSION DATE> NOVEMBER 20, SIXTH YEAR OF HEISEI 

</ TRANSMISSION DATE> 

<TEXT> 

HELLO. ARE YOU FINE ? 
</ TEXT> 
</ MEMO> 
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FIG. 4A 
PRIOR ART 



<MEMO> 

<NAME> TARO HEISEI </NAME> 

<TEXT> 
HELLO. 

</ TEXT> 
</ MEMO> 



FIG. 4B 
PRIOR ART 



<MEMO> 

TRANSMISSION QATE> NOVEMBER 20, SIXTH YEAR OF HEISEI 

</ TRANSMISSION DATE> 

<TEXT> 

HELLO. ARE YQU FINE ? 
</ TEXT> 
</MEMO> 



UNDERLINED PARTS : DIFFERENCE CHARACTER STRINGS 
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FIG. 5A 



<TREATISE> 

<AUTHOR NAME> TARO HEISEI </ AUTHOR NAME> 
<CHAPTER> 

<CHAPTER NUMBER> CHAPTER 1 </ CHAPTER NUMBER> 
STRUCTURED DOCUMENT DIFFERENCE EXTRACTION METHOD 
</ CHAPTER> 
</TREATISE> 



FIG. 5B 



<TREATISE> 

<AUTHOR NAME> TARO HEISEI </ AUTHOR NAME> 
<CHAPTER> 

<CHAPTER NUMBER> CHAPTER 1 </ CHAPTER NUMBER> 
WHAT IS STRUCTURED DOCUMENT ? 
</ CHAPTER> 
<CHAPTER> 

<CHAPTER NUMBER> CHAPTER 2 </ CHAPTER NUMBER> 
STRUCTURED DOCUMENT DIFFERENCE EXTRACTION METHOD 
</ CHAPTER> 
</ TREATISE> 
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FIG. 6A 
PRIOR ART 



<TREATISE> 

<AUTHOR NAME> TARO HEISEI </ AUTHOR NAME> 
<CHAPTER> 

<CHAPTER NUMBER> CHAPTER 1 </ CHAPTER NUMBER> 
STRUCTURED DOCUMENT DIFFERENCE EXTRACTION METHOD 
</ CHAPTER> 
</TREATISE> 



FIG. 6B 
PRIOR ART 



<TREATISE> 

<AUTHOR NAME> TARO HEISEI </ AUTHOR NAME> 
<CHAPTER> 

<CHAPTER NUMBER> CHAPTER 1 </ CHAPTER NUMBER> 
WHAT IS STRUCTURED DOCUMENT 2 
</CHAPTER> 
<CHAPTER> 

CHAPTER NUMBER> CHAPTER 2 </ CHA PTER NUMBER> 
STRUCTURED DOCUMENT DIFFERENCE EXTRACTION METHOD 
</ CHAPTER> 
</TREATISE> 



UNDERLINED PARTS : DIFFERENCE CHARACTER STRINGS 
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FIG. 7A 



<TREATISE> 

<AUTHOR NAME> TARO HEISEI </ AUTHOR NAME> 
<FIRST ITEM> 

STRUCTURED DOCUMENT DIFFERENCE EXTRACTION METHOD 
</ FIRST ITEM> 
</TREATISE> 



FIG. 7B 



<TREATISE> 

<AUTHOR NAME> TARO HEISEI <l AUTHOR NAME> 
<FIRST ITEM> 

WHAT IS STRUCTURED DOCUMENT ? 
</ FIRST ITEM> 
<ITEM> 

STRUCTURED DOCUMENT DIFFERENCE EXTRACTION METHOD 
</ ITEM> 
</TREATISE> 
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FIG. 8A PRIOR ART 



<TREATISE> 

<AUTHOR NAME> TARO HEISEi </ AUTHOR NAME> 
<FIRST ITEM> 

STRUCTURED DOCUMENT DIFFERENCE EXTRACTION METHOD 
</ FIRST ITEM> 
</TREATISE> 



FIG. 8B PRIOR ART 



<TREATISE> 

<AUTHOR NAME> TARO HEISEI </ AUTHOR NAME> 
<FIRST ITEM> 
WHAT IS STRUCTURED DOCUMENT ? 
</ FIRST ITEM> 
<ITEM> 

STRUCTURED DOCUMENT DIFFERENCE EXTRACTION METHOD 
</ITEM> 
</TREATISE> 



UNDERLINED PARTS : DIFFERENCE CHARACTER STRINGS 



FIG. 9 



COMPARISON CRITERION TABLE 


ITEM NO. 


TAG 


TYPE OF COMPARISON 


CRITERION 


1 


<NAME> 


IDENTITY TAG 


2 


TRANSMISSION 


IDENTITY TAG 




DATE> 
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FIG. 10A 



1001 

<NAME> TARO HEISEI <l NAME> 




<TEXT> 



HELLO. 



FIG. 10B 



1002 



<MEMO> 



TRANSMISSION DATE> 
NOVEMBER 20, SIXTH 
YEAR OF HEISEI 
</ TRANSMISSION DATE> 



<TEXT 



HELLO. 
ARE YOU FINE ? 
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FIG. 10C 



START ^ 



401 



ALLOCATE <MEMO>, A START TAG, TO A NODE 



402^ 



403^ 



ALLOCATE <NAME>, AN IDENTITY TAG, TOGETHER 
WITH <NAME> TARO HEISEI <l NAME> TO A NODE 
AS CHILD OF <MEMO> 



X 



ALLOCATE <TEXT>, A START TAG, TO CHILD NODE 
OF <MEMO> 



404 , 



ALLOCATE HELLO, A CHARACTER STRING, 
TO CHILD NODE OF <TEXT> 



405^ 



ALLOCATE </TEXT>, AN END TAG, TO CHILD 
NODE OF <TEXT> 



406 * 



ALLOCATE </ MEMO>, AN END TAG, TO CHILD 
NODE OF <MEMO> 



( ) 



PRODUCTION PROCEDURE FOR 
DOCUMENT TREE OF FIG. 10A 
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FIG.11A 



<MEMO> 

<NAME> TARQ HEISEI </ NAME> 

<TEXT> 
HELLO. 

</ TEXT> 
</ MEMO> 



FIG. 11B 



<MEMO> 

TRANSMISSION PATE> NOVE MBER 20, SIXTH YEAR OF HEISEl 



</ TRANSMIS SION DATE> 
<TEXT> 

HELLO. ARE YOU FINE ? 
</ TEXT> 
</ MEMO> 



UNDERLINED PARTS : DIFFERENCE CHARACTER STRINGS 



FIG. 12 



COMPARISON CRITERION TABLE 


ITEM NO. 


TAG 


TYPE OF COMPARISON 
CRITERION 


1 

2 


<AUTH0R NAME> 
<CHAPTER 

NUMBER> 


IDENTITY TAG 
IGNORING TAG 
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FIG. 13A 



<TREATISE> 



<AUTHOR NAME> 

TARO HEISEI 
</ AUTHOR NAME> 



<CHAPTER> 



</TREATISE> 



STRUCTURED 
DOCUMENT 
DIFFERENCE 
EXTRACTION 
METHOD 



</CHAPTER> 



1301 



FIG. 13B 



<TREATISE> 



<AUTHOR NAME> 

TARO HEISEI 
</ AUTHOR NAME> 



<CHAPTER> <CHAPTER> </TREATISE> 



L 



WHAT IS </CHAPTER> 
STRUCTURED 
DOCUMENT ? 



STRUCTURED 
DOCUMENT 
DIFFERENCE 
EXTRACTION 
METHOD 



</CHAPTER> 



1302 
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FIG. 14A 



<TREATISE> 

<AUTHOR NAME> TARO HEISEI </ AUTHOR NAME> 

^CHAPTER NUMBER> CHAPTER 1 </ CHAPTER NUMBER> 

STRUCTURED DOCUMENT DIFFERENCE EXTRACTION METHOD 
</ CHAPTER> 
</TREATISE> 



FIG. 14B 



<TREATISE> 

<AUTHOR NAME> TARO HEISEI </ AUTHOR NAME> 

^CHAPTER NUMBFR> CHAPTFR 1 </ CHAPTER NUM BER> 

WHAT IS ST pilCTURFn DOCUMENT ? 
</CHAPTER> 
<CHAPTER> 

<CHAPTER NUMBER> CHAPTER 2 </ CHAPTER NUMBER> 
STRUCTURED DOCUMENT DIFFERENCE EXTRACTION METHOD 
</ CHAPTER> 
</TREATISE> 



UNDERLINED PARTS : DIFFERENCE CHARACTER STRINGS 



FIG. 15 



COMPARISON CRITERION TABLE 


ITEM NO. 


TAG 


TYPE OF COMPARISON 
CRITERION 


1 

2 


<AUTHOR NAME> 
<ITEM> 
<FIRST ITEM> 


IDENTITY TAG 

} EQUIVALENCE TAG 
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FIG. 16A 



<AUTHOR NAME> 

TARO HEISEI 
</ AUTHOR NAME> 



<TREATISE> 




</TREATISE> 



STRUCTURED 
DOCUMENT 
DIFFERENCE 
EXTRACTION 
METHOD 



</ITEM> 



FIG. 16B 



<TREATISE> 



<AUTHOR NAME> 

TARO HEISEI 
</ AUTHOR NAME> 



WHAT IS 
STRUCTURED 
DOCUMENT ? 




<ITEM> 



1603 



STRUCTURED 
DOCUMENT 
DIFFERENCE 
EXTRACTION 
METHOD 



</TREATISE> 



</ITEM> 
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FIG. 17A 



<TREATISE> 

<AUTHOR NAME> TARO HEISEI </ AUTHOR NAME> 
<FIRST ITEM> 

STRUCTURED DOCUMENT DIFFERENCE EXTRACTION METHOD 
</ FIRST ITEM> 
</TREATISE> 



FIG. 17B 



<TREATISE> 

<AUTHOR NAME> TARO HEISEI </ AUTHOR NAME> 
<FIRST ITEM> 

WHAT IS STRUCTURED DOCUMENT ? 
</ FIRST ITEM> 
<ITEM> 

STRUCTURED DOCUMENT DIFFERENCE EXTRACTION METHOD 
</ ITEM> 
</TREATISE> 



UNDERLINED PARTS : DIFFERENCE CHARACTER STRINGS 
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FIG. 18A 



<MEMO> 
<SENDER> 

<ORGANIZATION> OO BANK </ORGANIZATION> 

<NAME> TARO HEISEI </ NAME> 
</ SENDER> 
<TEXT> 

HELLO. 
</ TEXT> 
</MEMO> 



FIG. 18B 



<MEMO> 
<RECEIVER> 
<ORGANIZATION> XX BANK </ ORGANIZATION> 
<NAME> TARO SHOWA </ NAME> 
</ RECEIVER> 
<TEXT> 

HELLO. ARE YOU FINE ? 
</ TEXT> 
</ MEMO> 



FIG. 19 



COMPARISON CRITERION TABLE 


ITEM NO. 


TAG 


TYPE OF COMPARISON 
CRITERION 


1 
2 


<SENDER> 
<RECEIVER> 


} NO-COMPARISON TAG 
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FIG. 20A 




<ORGANIZATION> <NAME> </SENDER> HELLO, 



OO BANK <y ORGANIZATION TARO </NAME> 
HEISEI 



</TEXT> 



FIG. 20B 




<ORGANIZATION> <NAME> </RECEIVER> HELLO. </TEXT> 

ARE YOU FINE ? 



BANK </ORGANI2ATION> TARO </NAME> 
SHOWA 
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FIG. 21A 



<MEMO> 
<SENDER> 

<ORGANIZATION> nn BANK </ ORGANIZATION 

<NAME> TARO HEISEI <l NAME> 
</gENPER> 
<TEXT> 

HELLO. 
</ TEXT> 
</ MEMO> 



FIG. 21 B 



<MEMO> 
<R E C EIVER> 

<ORGANIZATION> x x BANK </ ORGANIZATION 

<NAME> TARO SHOWA </ NAME> 
<l RECEIVER> 
<TEXT> 

HELLO. ARE YOU FINE ? 
</ TEXT> 
</MEMO> 



UNDERLINED PARTS : DIFFERENCE CHARACTER STRINGS 
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FIG. 22 



504 

S 



503 



READ DOCUMENT 

i 




PARSE LOGICAL STRUCTURE OF DOCUMENT 



506 

S 



EXTRACT DIFFERENCE 



-505 



OUTPUT STRUCTURED 
DIFFERENCE INFORMATION 



DISPLAY AND STORE STRUCTURED 
DIFFERENCE INFORMATION 



507 



508 



EDITING 
COMPLETED^ 



t YES 



EP 0 747 836 A1 



FIG. 23A 



<MEMO> 

<TRANSMITTER> 

<NAME> TARO HEISEI </ NAME> 
</ TRANSMITTER> 
<TEXT> 

HELLO. 
</ TEXT> 
</MEMO> 



STRUCTURED DOCUMENT a 



FIG. 23B 



<MEMO> 

<TRANSMITTER> 

<NAME> JIRO SHOWA </NAME> 

<ORGANIZATION> ABC COMPANY </ ORGANIZATION> 
</ TRANSMITTER> 
<TEXT> 

HELLO. ARE YOU FINE ? 
</ TEXT> 
</MEMO> 



STRUCTURED DOCUMENT a' 



FIG. 24 A EXAMPLE OF DIFFERENCE DATA 



ITEM NO. 


DIFFERENCE DATA 


1 


'TARO HEISEI" IN <NAME> CHANGED TO 
"JIRO SHOWA" 


2 


CHILD STRUCTURE "<ORGANIZATION> ABC 
COMPANY </ ORGANIZATION> OF <SENDER>" 
INSERTED BEHIND STRUCTURE <NAME> 


3 


"ARE YOU FINE T IN <TEXT> INSERTED AT 
SEVENTH CHARACTER 



FIG. 24B 



COMPARATIVE EXAMPLE OF DIFFERENCE 
DATA DISPLAY 



CHANGE 
PORTIONS 



<MEMO> 
<SENDER> 

<NAME> JIRO SHOWA <l NAME> 

<ORGANIZATION> ABC COMPANY </ ORGANIZATION> 
<l SENDER> 
<TEXT> 

HELLO. ARE YOU FINE ? 
</TEXT> 
</MEMO> 



FIG. 25 EXAMPLE OF STRUCTURED DIFFERENCE DATA 



<MEMO> 
<TRANSMITTER> 
<NAME> 

<BEFORE CHANGE> TARO HEISEI </ BEFORE CHANGE> 
<AFTER CHANGE> JIRO SHOWA </ AFTER CHANGE> 
</ NAME> 

<ORGANIZATION diffflag = INSERTION ABC COMPANY 

</ ORGANIZATION 

<TEXT> 

HELLO. <INSERTION> ARE YOU FINE ? </ INSERTION 
</TEXT> 
</ MEMO> 



FIG. 26A 



STRUCTURED DOCUMENT a.SGM 


g) MEMO 
"{§ TRANSMITTER 
1(5 NAME 
1(5 "TEXT 


TARO HEISEI 
HELLO. 

— / - ' 



2301 2302 



FIG. 26B 



STRUCTURED DOCUMENT a'.SGM 


0 


D 


MEMO 






Q 


D TRANSMITTER 








(5 NAME 


JIRO SHOWA 






(5 ORGANIZATION 


ABC COMPANY 




(5 TEXT 


HELLO. ARE YOU FINE ? 
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FIG. 27 



STRUCTURED DOCUMENT DIFFERENCE DATA .SGM 


9 


) MEMO 






D TRANSMITTER 






QD NAME 


TAOR HEISEI JIRO SHOWA 




O ORGANIZATION 


ABC COMPANY 




D TEXT 


HELLO. ARE YOU FINE ? 
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